基于文本的人检索旨在根据文本描述找到查询人员。关键是学习视觉文本模式之间的常见潜在空间映射。为了实现这一目标,现有的作品采用细分来获得明确的跨模式对齐方式或利用注意力来探索显着对准。这些方法有两个缺点:1)标记交叉模式比对很耗时。 2)注意方法可以探索显着的跨模式对齐,但可能会忽略一些微妙而有价值的对。为了缓解这些问题,我们为基于文本的人检索引入了一个隐式视觉文本(IVT)框架。与以前的模型不同,IVT利用单个网络来学习两种模式的表示形式,这有助于视觉文本相互作用。为了探索细粒的对准,我们进一步提出了两个隐式语义比对范式:多级比对(MLA)和双向掩码建模(BMM)。 MLA模块在句子,短语和单词级别上探索了更精细的匹配,而BMM模块旨在挖掘视觉和文本模态之间的\ textbf {更多}语义对齐。进行了广泛的实验,以评估公共数据集中提出的IVT,即Cuhk-Pedes,RSTPREID和ICFG-PEDES。即使没有明确的身体部位对准,我们的方法仍然可以达到最先进的表现。代码可在以下网址获得:https://github.com/tencentyouturesearch/personretrieval-ivt。
translated by 谷歌翻译
关于多模式情感分析的现有研究在很大程度上依赖文本方式,不可避免地会引起文本单词和情感标签之间的虚假相关性。这极大地阻碍了模型的概括能力。为了解决这个问题,我们定义了分发(OOD)多模式分析的任务。该任务旨在估计和减轻文本方式对强大概括的不良影响。为此,我们接受了因果推断,该因果通过因果图检查了因果关系。从图中,我们发现虚假相关性归因于文本模式对模型预测的直接影响,而间接相关性通过考虑多模式语义来更可靠。受此启发的启发,我们设计了一个模型不足的反事实框架,用于多模式情感分析,该框架通过额外的文本模型捕获文本模式的直接效果,并通过多模型估算间接模型。在推断期间,我们首先通过反事实推断估算直接效应,然后从所有模式的总效应中减去它以获得可靠预测的间接效应。广泛的实验显示了我们提出的框架的卓越有效性和概括能力。
translated by 谷歌翻译
我们介绍了MTG,这是一套新的基准套件,用于培训和评估多语言文本生成。它是具有最大人类通知数据(400K)的第一次传播的多语言多路文本生成数据集。它包括五种语言(英语,德语,法语,西班牙语和中文)的四代任务(故事产生,问题生成,标题生成和文本摘要)。Multiway设置可以启用跨语言和任务的模型测试知识传输功能。使用MTG,我们从不同方面训练和分析了几种流行的多语言生成模型。我们的基准套件通过更多的人为宣传的并行数据促进了模型性能增强。它提供了各种一代方案的全面评估。代码和数据可在\ url {https://github.com/zide05/mtg}上获得。
translated by 谷歌翻译
With an ever-growing number of parameters defining increasingly complex networks, Deep Learning has led to several breakthroughs surpassing human performance. As a result, data movement for these millions of model parameters causes a growing imbalance known as the memory wall. Neuromorphic computing is an emerging paradigm that confronts this imbalance by performing computations directly in analog memories. On the software side, the sequential Backpropagation algorithm prevents efficient parallelization and thus fast convergence. A novel method, Direct Feedback Alignment, resolves inherent layer dependencies by directly passing the error from the output to each layer. At the intersection of hardware/software co-design, there is a demand for developing algorithms that are tolerable to hardware nonidealities. Therefore, this work explores the interrelationship of implementing bio-plausible learning in-situ on neuromorphic hardware, emphasizing energy, area, and latency constraints. Using the benchmarking framework DNN+NeuroSim, we investigate the impact of hardware nonidealities and quantization on algorithm performance, as well as how network topologies and algorithm-level design choices can scale latency, energy and area consumption of a chip. To the best of our knowledge, this work is the first to compare the impact of different learning algorithms on Compute-In-Memory-based hardware and vice versa. The best results achieved for accuracy remain Backpropagation-based, notably when facing hardware imperfections. Direct Feedback Alignment, on the other hand, allows for significant speedup due to parallelization, reducing training time by a factor approaching N for N-layered networks.
translated by 谷歌翻译
The input and output of most text generation tasks can be transformed to two sequences of tokens and they can be modeled using sequence-to-sequence learning modeling tools such as Transformers. These models are usually trained by maximizing the likelihood the output text sequence and assumes the input sequence and all gold preceding tokens are given during training, while during inference the model suffers from the exposure bias problem (i.e., it only has access to its previously predicted tokens rather gold tokens during beam search). In this paper, we propose MoCa ({\bf Mo}mentum {\bf Ca}libration) for text generation. MoCa is an online method that dynamically generates slowly evolving (but consistent) samples using a momentum moving average generator with beam search and MoCa learns to align its model scores of these samples with their actual qualities. Experiments on four text generation datasets (i.e., CNN/DailyMail, XSum, SAMSum and Gigaword) show MoCa consistently improves strong pre-trained transformers using vanilla fine-tuning and we achieve the state-of-the-art results on CNN/DailyMail and SAMSum datasets.
translated by 谷歌翻译
The interaction and dimension of points are two important axes in designing point operators to serve hierarchical 3D models. Yet, these two axes are heterogeneous and challenging to fully explore. Existing works craft point operator under a single axis and reuse the crafted operator in all parts of 3D models. This overlooks the opportunity to better combine point interactions and dimensions by exploiting varying geometry/density of 3D point clouds. In this work, we establish PIDS, a novel paradigm to jointly explore point interactions and point dimensions to serve semantic segmentation on point cloud data. We establish a large search space to jointly consider versatile point interactions and point dimensions. This supports point operators with various geometry/density considerations. The enlarged search space with heterogeneous search components calls for a better ranking of candidate models. To achieve this, we improve the search space exploration by leveraging predictor-based Neural Architecture Search (NAS), and enhance the quality of prediction by assigning unique encoding to heterogeneous search components based on their priors. We thoroughly evaluate the networks crafted by PIDS on two semantic segmentation benchmarks, showing ~1% mIOU improvement on SemanticKITTI and S3DIS over state-of-the-art 3D models.
translated by 谷歌翻译
Multi-agent artificial intelligence research promises a path to develop intelligent technologies that are more human-like and more human-compatible than those produced by "solipsistic" approaches, which do not consider interactions between agents. Melting Pot is a research tool developed to facilitate work on multi-agent artificial intelligence, and provides an evaluation protocol that measures generalization to novel social partners in a set of canonical test scenarios. Each scenario pairs a physical environment (a "substrate") with a reference set of co-players (a "background population"), to create a social situation with substantial interdependence between the individuals involved. For instance, some scenarios were inspired by institutional-economics-based accounts of natural resource management and public-good-provision dilemmas. Others were inspired by considerations from evolutionary biology, game theory, and artificial life. Melting Pot aims to cover a maximally diverse set of interdependencies and incentives. It includes the commonly-studied extreme cases of perfectly-competitive (zero-sum) motivations and perfectly-cooperative (shared-reward) motivations, but does not stop with them. As in real-life, a clear majority of scenarios in Melting Pot have mixed incentives. They are neither purely competitive nor purely cooperative and thus demand successful agents be able to navigate the resulting ambiguity. Here we describe Melting Pot 2.0, which revises and expands on Melting Pot. We also introduce support for scenarios with asymmetric roles, and explain how to integrate them into the evaluation protocol. This report also contains: (1) details of all substrates and scenarios; (2) a complete description of all baseline algorithms and results. Our intention is for it to serve as a reference for researchers using Melting Pot 2.0.
translated by 谷歌翻译
Various depth estimation models are now widely used on many mobile and IoT devices for image segmentation, bokeh effect rendering, object tracking and many other mobile tasks. Thus, it is very crucial to have efficient and accurate depth estimation models that can run fast on low-power mobile chipsets. In this Mobile AI challenge, the target was to develop deep learning-based single image depth estimation solutions that can show a real-time performance on IoT platforms and smartphones. For this, the participants used a large-scale RGB-to-depth dataset that was collected with the ZED stereo camera capable to generated depth maps for objects located at up to 50 meters. The runtime of all models was evaluated on the Raspberry Pi 4 platform, where the developed solutions were able to generate VGA resolution depth maps at up to 27 FPS while achieving high fidelity results. All models developed in the challenge are also compatible with any Android or Linux-based mobile devices, their detailed description is provided in this paper.
translated by 谷歌翻译
在交互环境中学习操纵3D对象一直是强化学习(RL)的挑战性问题。特别是,很难训练可以概括具有不同语义类别,多样形状几何形状和多功能功能的对象的策略。最近,视觉负担能力的技术在提供有效的可操作语义方面提供了以对象为中心的信息先验的前景。因此,可以通过知道如何在手柄上施加力来训练有效的政策来打开门。但是,要学习负担能力,它通常需要人为定义的动作基础,这限制了适用的任务范围。在这项研究中,我们通过使用RL训练过程中生成的联系信息来预测感兴趣的接触图,利用视觉负担。然后,这种联系预测过程会导致一个端到端的负担能力学习框架,该框架可以概括不同类型的操纵任务。令人惊讶的是,这种框架的有效性即使在多阶段和多代理场景下也具有。我们对八种类型的操纵任务进行了测试。结果表明,我们的方法优于基线算法,包括基于视觉的负担方法和RL方法,其成功率很大。演示可以在https://sites.google.com/view/rlafford/上找到。
translated by 谷歌翻译
即使面对分布(OOD)样本,也必须信任机器学习方法在现实世界环境中做出适当的决定。当前的许多方法只是旨在检测OOD示例并在给出未识别的输入时提醒用户。但是,当OOD样本与训练数据显着重叠时,二进制异常检测是无法解释或解释的,并且很少向用户提供信息。我们提出了一个新的OOD检测模型,随着输入变得更加模棱两可,在不同水平的粒度水平上进行预测,模型预测变得更加粗糙,更保守。考虑一个遇到未知鸟类和汽车的动物分类器。两种情况都是OOD,但是如果分类器认识到其对特定物种的不确定性太大并预测鸟类而不是将其视为OOD,则用户获得了更多信息。此外,我们在层次结构的每个级别上诊断了分类器的性能,以改善模型预测的解释性和解释性。我们证明了分层分类器对细粒和粗粒的OOD任务的有效性。
translated by 谷歌翻译